Model Selection

Vision-Language Pretraining

# Vision-Language Pretraining

Blip Custom Captioning

BLIP is a unified vision-language pretraining framework, excelling in vision-language tasks such as image caption generation

Sail Clip Hendrix 10epochs

A vision-language model fine-tuned from openai/clip-vit-large-patch14, trained for 10 epochs

Vit So400m Patch14 Siglip 384.webli

Vision Transformer model based on SigLIP architecture, containing only the image encoder part, utilizing raw attention pooling mechanism

Image Classification

Vit Base Patch16 Siglip 512.webli

Vision Transformer model based on SigLIP architecture, containing only the image encoder part, using original attention pooling mechanism

Image Classification

Minivla Vq Bridge Prismatic

MiniVLA is a more compact yet higher-performing vision-language-action model, compatible with the Prismatic VLMs project codebase.

Transformers English

Biomedclip ViT Patch16 224

BiomedCLIP is a biomedical vision-language processing model developed by Microsoft, based on PubMedBERT and ViT architecture, specifically designed for the biomedical domain.

Multimodal Fusion

Image Captioning With Blip

BLIP is a unified vision-language pretraining framework, excelling in tasks like image caption generation, supporting both conditional and unconditional text generation

Vilt Finetuned 200

Vision-language model based on ViLT architecture, fine-tuned for specific tasks

Llava V1.5 Mlp2x 336px Pretrain Vicuna 7b V1.5

LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna and trained with GPT-generated multimodal instruction-following data.

Image Caption Large Copy

BLIP is an advanced vision-language pretraining model, excelling in image captioning tasks by effectively utilizing web data through guided annotation strategies

OTTER MPT7B Init

OTTER-MPT7B-Init is a set of weights for initializing Otter model training, converted directly from Openflamingo.

Image caption generation model fine-tuned based on Salesforce/blip-image-captioning-base

Pix2struct Large

Pix2Struct is an image encoder-text decoder model trained on image-text pairs, suitable for various vision-language tasks

Transformers Supports Multiple Languages

Blip Image Captioning Base Football Finetuned

A vision-language model pre-trained on COCO and fine-tuned on a football dataset, proficient in generating image captions

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase